A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy
نویسنده
چکیده
Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online-generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework. INDEX WORDS: Multi-label Text Classification, Feature Selection, Word2Vec, Natural Language Processing, Depression Symptoms, Social Medias A MULTI-LABEL TEXT CLASSIFICATION FRAMEWORK: USING SUPERVISED AND UNSUPERVISED FEATURE SELECTION STRATEGY
منابع مشابه
MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection
Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملDistributed Clustering with Feature Selection for Text Documents Based on Ontology
Feature selection has been extensively used in supervised learning, such as text classification. It (Devaney and Ram 1997) minimizes the high dimensionality of the feature space and also offers improved data understanding which enhances the clustering result. The chosen feature set should consist of adequate data about the original data set. It is believed that feature selection can enhance the...
متن کاملSupport Vector Machine Based Facies Classification Using Seismic Attributes in an Oil Field of Iran
Seismic facies analysis (SFA) aims to classify similar seismic traces based on amplitude, phase, frequency, and other seismic attributes. SFA has proven useful in interpreting seismic data, allowing significant information on subsurface geological structures to be extracted. While facies analysis has been widely investigated through unsupervised-classification-based studies, there are few cases...
متن کاملREADER: Robust Semi-Supervised Multi-Label Dimension Reduction
Multi-label classification is an appealing and challenging supervised learning problem, where multiple labels, rather than a single label, are associated with an unseen test instance. To remove possible noises in labels and features of high-dimensionality, multi-label dimension reduction has attracted more and more attentions in recent years. The existing methods usually suffer from several pro...
متن کامل